Table of Contents¶

  1. Problem Statement
    • 1.1 Introduction
    • 1.2 Data source and data set
  2. Load the Packages and Data
  3. Data Profiling
    • 3.1 Understanding the Dataset
    • 3.2 Pre Profiling
    • 3.3 Preprocessing
    • 3.4 Post Profiling
  4. Questions
    • 4.1 How has the number of movies released per year changed over the last 20-30 years?
    • 4.2 Comparison of tv shows vs. movies.
    • 4.3 What is the best time to launch a TV show?
    • 4.4 Analysis of actors/directors of different types of shows/movies.
    • 4.5 Does Netflix has more focus on TV Shows than movies in recent years
    • 4.6 Understanding what content is available in different countries
    • 4.8 Distribution of Price
  5. Conclusions

1. Problem Statement¶

Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries

1.1. Introduction¶

This Exploratory Data Analysis is to practice Python skills learned till now on a structured data set including loading, inspecting, wrangling, exploring, and drawing conclusions from data. The notebook has observations with each step in order to explain thoroughly how to approach the data set. Based on the observation some questions also are answered in the notebook for the reference though not all of them are explored in the analysis.

About Data¶

This dataset contains data collected from Netflix of different TV shows and movies from the year 2008 to 2021.¶

  • type: Gives information about 2 different unique values one is TV Show and another is Movie
  • title: Gives information about the title of Movie or TV Show
  • director: Gives information about the director who directed the Movie or TV Show
  • cast: Gives information about the cast who plays role in Movie or TV Show
  • release_year: Gives information about the year when Movie or TV Show was released
  • rating: Gives information about the Movie or TV Show are in which category (eg like the movies are only for students, or adults, etc)
  • duration: Gives information about the duration of Movie or TV Show
  • listed_in: Gives information about the genre of Movie or TV Show
  • description: Gives information about the description of Movie or TV Show

1.2. Data source and dataset¶

a. How was it collected?

  • Name: "Netflix Data"
  • Sponsoring Organization: Netflix
  • Year: 2021
  • Description: "This is a case study of Netflix Movies and TV shows from 1925 to 2021"

2. Import Packages¶

In [178]:
%%time
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import pandas_profiling as prf
%matplotlib inline
Wall time: 5.99 ms

Loading Dataset¶

In [179]:
netflix = pd.read_csv(r"C:\Users\modem\Downloads\netflix_data.csv")
In [180]:
netflix
Out[180]:
Unnamed: 0 show_id type title director cast country date_added release_year rating duration listed_in description
0 0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States 25-Sep-21 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN 24-Sep-21 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 3 s4 TV Show Jailbirds New Orleans NaN NaN NaN 24-Sep-21 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
8802 8802 s8803 Movie Zodiac David Fincher Mark Ruffalo, Jake Gyllenhaal, Robert Downey J... United States 20-Nov-19 2007 R 158 min Cult Movies, Dramas, Thrillers A political cartoonist, a crime reporter and a...
8803 8803 s8804 TV Show Zombie Dumb NaN NaN NaN 1-Jul-19 2018 TV-Y7 2 Seasons Kids' TV, Korean TV Shows, TV Comedies While living alone in a spooky town, a young g...
8804 8804 s8805 Movie Zombieland Ruben Fleischer Jesse Eisenberg, Woody Harrelson, Emma Stone, ... United States 1-Nov-19 2009 R 88 min Comedies, Horror Movies Looking to survive in a world taken over by zo...
8805 8805 s8806 Movie Zoom Peter Hewitt Tim Allen, Courteney Cox, Chevy Chase, Kate Ma... United States 11-Jan-20 2006 PG 88 min Children & Family Movies, Comedies Dragged from civilian life, a former superhero...
8806 8806 s8807 Movie Zubaan Mozez Singh Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... India 2-Mar-19 2015 TV-14 111 min Dramas, International Movies, Music & Musicals A scrappy but poor boy worms his way into a ty...

8807 rows × 13 columns

In [181]:
netflix_copy = netflix.copy()
In [182]:
netflix_copy.head()
Out[182]:
Unnamed: 0 show_id type title director cast country date_added release_year rating duration listed_in description
0 0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States 25-Sep-21 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN 24-Sep-21 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 3 s4 TV Show Jailbirds New Orleans NaN NaN NaN 24-Sep-21 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...

3. Data Profiling¶

3.1 Understanding the dataset

In [183]:
netflix.shape
Out[183]:
(8807, 13)
In [184]:
netflix.columns
Out[184]:
Index(['Unnamed: 0', 'show_id', 'type', 'title', 'director', 'cast', 'country',
       'date_added', 'release_year', 'rating', 'duration', 'listed_in',
       'description'],
      dtype='object')
In [185]:
netflix.describe()
Out[185]:
Unnamed: 0 release_year
count 8807.000000 8807.000000
mean 4403.000000 2014.180198
std 2542.506244 8.819312
min 0.000000 1925.000000
25% 2201.500000 2013.000000
50% 4403.000000 2017.000000
75% 6604.500000 2019.000000
max 8806.000000 2021.000000
In [186]:
netflix.describe(include='all')
Out[186]:
Unnamed: 0 show_id type title director cast country date_added release_year rating duration listed_in description
count 8807.000000 8807 8807 8807 6173 7982 7976 8797 8807.000000 8803 8804 8807 8807
unique NaN 8807 2 8804 4528 7692 746 1767 NaN 17 220 514 8775
top NaN s1 Movie 15-Aug Rajiv Chilaka David Attenborough United States 1-Jan-20 NaN TV-MA 1 Season Dramas, International Movies Paranormal activity at a lush, abandoned prope...
freq NaN 1 6131 2 19 19 2818 109 NaN 3207 1793 362 4
mean 4403.000000 NaN NaN NaN NaN NaN NaN NaN 2014.180198 NaN NaN NaN NaN
std 2542.506244 NaN NaN NaN NaN NaN NaN NaN 8.819312 NaN NaN NaN NaN
min 0.000000 NaN NaN NaN NaN NaN NaN NaN 1925.000000 NaN NaN NaN NaN
25% 2201.500000 NaN NaN NaN NaN NaN NaN NaN 2013.000000 NaN NaN NaN NaN
50% 4403.000000 NaN NaN NaN NaN NaN NaN NaN 2017.000000 NaN NaN NaN NaN
75% 6604.500000 NaN NaN NaN NaN NaN NaN NaN 2019.000000 NaN NaN NaN NaN
max 8806.000000 NaN NaN NaN NaN NaN NaN NaN 2021.000000 NaN NaN NaN NaN

gives the count and unique values in each columns

In [187]:
netflix.sort_values(by=['release_year'],ascending=False).head(5)
Out[187]:
Unnamed: 0 show_id type title director cast country date_added release_year rating duration listed_in description
693 693 s694 Movie Ali & Ratu Ratu Queens Lucky Kuswandi Iqbaal Ramadhan, Nirina Zubir, Asri Welas, Tik... NaN 17-Jun-21 2021 TV-14 101 min Comedies, Dramas, International Movies After his father's passing, a teenager sets ou...
781 781 s782 Movie Black Holes | The Edge of All We Know Peter Galison NaN NaN 2-Jun-21 2021 TV-14 99 min Documentaries Follow scientists on their quest to understand...
762 762 s763 Movie Sweet & Sour Lee Kae-byeok Jang Ki-yong, Chae Soo-bin, Jung Soo-jung South Korea 4-Jun-21 2021 TV-14 103 min Comedies, International Movies, Romantic Movies Faced with real-world opportunities and challe...
763 763 s764 TV Show Sweet Tooth NaN Nonso Anozie, Christian Convery, Adeel Akhtar,... United States 4-Jun-21 2021 TV-14 1 Season TV Action & Adventure, TV Dramas, TV Sci-Fi & ... On a perilous adventure across a post-apocalyp...
764 764 s765 Movie Trippin' with the Kandasamys Jayan Moodley Jailoshini Naidoo, Maeshni Naicker, Madhushan ... South Africa 4-Jun-21 2021 TV-14 94 min Comedies, International Movies, Romantic Movies To rekindle their marriages, best friends-turn...
In [188]:
netflix.nunique()
Out[188]:
Unnamed: 0      8807
show_id         8807
type               2
title           8804
director        4528
cast            7692
country          746
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

the above table gives unique values in each columns

In [189]:
netflix.corr()
Out[189]:
Unnamed: 0 release_year
Unnamed: 0 1.000000 -0.246713
release_year -0.246713 1.000000
In [190]:
netflix.tail()
Out[190]:
Unnamed: 0 show_id type title director cast country date_added release_year rating duration listed_in description
8802 8802 s8803 Movie Zodiac David Fincher Mark Ruffalo, Jake Gyllenhaal, Robert Downey J... United States 20-Nov-19 2007 R 158 min Cult Movies, Dramas, Thrillers A political cartoonist, a crime reporter and a...
8803 8803 s8804 TV Show Zombie Dumb NaN NaN NaN 1-Jul-19 2018 TV-Y7 2 Seasons Kids' TV, Korean TV Shows, TV Comedies While living alone in a spooky town, a young g...
8804 8804 s8805 Movie Zombieland Ruben Fleischer Jesse Eisenberg, Woody Harrelson, Emma Stone, ... United States 1-Nov-19 2009 R 88 min Comedies, Horror Movies Looking to survive in a world taken over by zo...
8805 8805 s8806 Movie Zoom Peter Hewitt Tim Allen, Courteney Cox, Chevy Chase, Kate Ma... United States 11-Jan-20 2006 PG 88 min Children & Family Movies, Comedies Dragged from civilian life, a former superhero...
8806 8806 s8807 Movie Zubaan Mozez Singh Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan... India 2-Mar-19 2015 TV-14 111 min Dramas, International Movies, Music & Musicals A scrappy but poor boy worms his way into a ty...
In [191]:
netflix.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    8807 non-null   int64 
 1   show_id       8807 non-null   object
 2   type          8807 non-null   object
 3   title         8807 non-null   object
 4   director      6173 non-null   object
 5   cast          7982 non-null   object
 6   country       7976 non-null   object
 7   date_added    8797 non-null   object
 8   release_year  8807 non-null   int64 
 9   rating        8803 non-null   object
 10  duration      8804 non-null   object
 11  listed_in     8807 non-null   object
 12  description   8807 non-null   object
dtypes: int64(2), object(11)
memory usage: 894.6+ KB
In [192]:
(netflix.isnull().sum()/len(netflix))*100
Out[192]:
Unnamed: 0       0.000000
show_id          0.000000
type             0.000000
title            0.000000
director        29.908028
cast             9.367549
country          9.435676
date_added       0.113546
release_year     0.000000
rating           0.045418
duration         0.034064
listed_in        0.000000
description      0.000000
dtype: float64

There were 29 % missing values in Director column, 9 % each in cast and country column

3.2 Pre Profiling

In [193]:
profile_before = prf.ProfileReport(netflix)
In [194]:
profile_before
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[194]:

In [195]:
profile_before.to_file(output_file="Netflix_Before_PreProcessing.html")
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

3.3 Preprocessing

In [196]:
netflix.head()
Out[196]:
Unnamed: 0 show_id type title director cast country date_added release_year rating duration listed_in description
0 0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States 25-Sep-21 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN 24-Sep-21 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 3 s4 TV Show Jailbirds New Orleans NaN NaN NaN 24-Sep-21 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...

Since some columns have nested values, will unnest them and prepare final dataset

In [197]:
## Unnesting director column

dir_constraint=netflix['director'].apply(lambda x: str(x).split(', ')).tolist()
df1 = pd.DataFrame(dir_constraint, index = netflix['title']) 
df1 = df1.stack()
df1 = pd.DataFrame(df1.reset_index())
df1.rename(columns={0:'Directors'},inplace=True)
df1 = df1.drop(['level_1'],axis=1)
df1.head(10)
Out[197]:
title Directors
0 Dick Johnson Is Dead Kirsten Johnson
1 Blood & Water nan
2 Ganglands Julien Leclercq
3 Jailbirds New Orleans nan
4 Kota Factory nan
5 Midnight Mass Mike Flanagan
6 My Little Pony: A New Generation Robert Cullen
7 My Little Pony: A New Generation José Luis Ucha
8 Sankofa Haile Gerima
9 The Great British Baking Show Andy Devonshire
In [198]:
## Unnesting - cast column

cast_constraint=netflix['cast'].apply(lambda x: str(x).split(', ')).tolist()
df2 = pd.DataFrame(cast_constraint, index = netflix['title']) 
df2 = df2.stack()
df2 = pd.DataFrame(df2.reset_index())
df2.rename(columns={0:'Actors'},inplace=True)
df2 = df2.drop(['level_1'],axis=1)
df2.head(10)
Out[198]:
title Actors
0 Dick Johnson Is Dead nan
1 Blood & Water Ama Qamata
2 Blood & Water Khosi Ngema
3 Blood & Water Gail Mabalane
4 Blood & Water Thabang Molaba
5 Blood & Water Dillon Windvogel
6 Blood & Water Natasha Thahane
7 Blood & Water Arno Greeff
8 Blood & Water Xolile Tshabalala
9 Blood & Water Getmore Sithole
In [199]:
netflix.head()
Out[199]:
Unnamed: 0 show_id type title director cast country date_added release_year rating duration listed_in description
0 0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States 25-Sep-21 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN 24-Sep-21 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 3 s4 TV Show Jailbirds New Orleans NaN NaN NaN 24-Sep-21 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India 24-Sep-21 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
In [200]:
## Unnesting - listed_in column

listed_constraint=netflix['listed_in'].apply(lambda x: str(x).split(', ')).tolist()
df3 = pd.DataFrame(listed_constraint, index = netflix['title']) 
df3 = df3.stack()
df3 = pd.DataFrame(df3.reset_index())
df3.rename(columns={0:'Genre'},inplace=True)
df3 = df3.drop(['level_1'],axis=1)
df3.head(10)
Out[200]:
title Genre
0 Dick Johnson Is Dead Documentaries
1 Blood & Water International TV Shows
2 Blood & Water TV Dramas
3 Blood & Water TV Mysteries
4 Ganglands Crime TV Shows
5 Ganglands International TV Shows
6 Ganglands TV Action & Adventure
7 Jailbirds New Orleans Docuseries
8 Jailbirds New Orleans Reality TV
9 Kota Factory International TV Shows
In [201]:
## Unnesting - country column

country_constraint=netflix['country'].apply(lambda x: str(x).split(', ')).tolist()
df4 = pd.DataFrame(country_constraint, index = netflix['title']) 
df4 = df4.stack()
df4 = pd.DataFrame(df4.reset_index())
df4.rename(columns={0:'Country'},inplace=True)
df4 = df4.drop(['level_1'],axis=1)
df4.head(10)
Out[201]:
title Country
0 Dick Johnson Is Dead United States
1 Blood & Water South Africa
2 Ganglands nan
3 Jailbirds New Orleans nan
4 Kota Factory India
5 Midnight Mass nan
6 My Little Pony: A New Generation nan
7 Sankofa United States
8 Sankofa Ghana
9 Sankofa Burkina Faso

Collate all the unnested dataframes

In [202]:
df5 = df2.merge(df1,on=['title'],how='inner')

df6 = df5.merge(df3,on=['title'],how='inner')

df7 = df6.merge(df4,on=['title'],how='inner')
In [203]:
df7.head()
Out[203]:
title Actors Directors Genre Country
0 Dick Johnson Is Dead nan Kirsten Johnson Documentaries United States
1 Blood & Water Ama Qamata nan International TV Shows South Africa
2 Blood & Water Ama Qamata nan TV Dramas South Africa
3 Blood & Water Ama Qamata nan TV Mysteries South Africa
4 Blood & Water Khosi Ngema nan International TV Shows South Africa
In [204]:
df7
Out[204]:
title Actors Directors Genre Country
0 Dick Johnson Is Dead nan Kirsten Johnson Documentaries United States
1 Blood & Water Ama Qamata nan International TV Shows South Africa
2 Blood & Water Ama Qamata nan TV Dramas South Africa
3 Blood & Water Ama Qamata nan TV Mysteries South Africa
4 Blood & Water Khosi Ngema nan International TV Shows South Africa
... ... ... ... ... ...
203158 Zubaan Anita Shabdish Mozez Singh International Movies India
203159 Zubaan Anita Shabdish Mozez Singh Music & Musicals India
203160 Zubaan Chittaranjan Tripathy Mozez Singh Dramas India
203161 Zubaan Chittaranjan Tripathy Mozez Singh International Movies India
203162 Zubaan Chittaranjan Tripathy Mozez Singh Music & Musicals India

203163 rows × 5 columns

In [205]:
df7.shape
Out[205]:
(203163, 5)

merging unnested data with the given dataframe

In [206]:
netflix = df7.merge(netflix[['show_id', 'type', 'title', 'date_added',
       'release_year', 'rating', 'duration']],on=['title'],how='left')
netflix.head()
Out[206]:
title Actors Directors Genre Country show_id type date_added release_year rating duration
0 Dick Johnson Is Dead nan Kirsten Johnson Documentaries United States s1 Movie 25-Sep-21 2020 PG-13 90 min
1 Blood & Water Ama Qamata nan International TV Shows South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
2 Blood & Water Ama Qamata nan TV Dramas South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
3 Blood & Water Ama Qamata nan TV Mysteries South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
4 Blood & Water Khosi Ngema nan International TV Shows South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
In [207]:
netflix.shape
Out[207]:
(204539, 11)

Final Dataset will have around 2 Lakh rows and 11 columns

In [208]:
netflix.isna().sum()
Out[208]:
title             0
Actors            0
Directors         0
Genre             0
Country           0
show_id           0
type              0
date_added      158
release_year      0
rating           67
duration          3
dtype: int64

There were some missing values in date_added and rating will treat them

In [209]:
total_null = netflix.isnull().sum().sort_values(ascending = False)
percent = ((netflix.isnull().sum()/netflix.isnull().count())*100).sort_values(ascending = False)
print("Total records = ", netflix.shape[0])

missing_data = pd.concat([total_null,percent.round(2)],axis=1,keys=['Total Missing','In Percent'])
missing_data.head(10)
Total records =  204539
Out[209]:
Total Missing In Percent
date_added 158 0.08
rating 67 0.03
duration 3 0.00
title 0 0.00
Actors 0 0.00
Directors 0 0.00
Genre 0 0.00
Country 0 0.00
show_id 0 0.00
type 0 0.00

Above table gives missing values summary in absolute value and in Percentage, date added has the maximum missing values

Treating Missing values

In [210]:
import numpy as np
In [211]:
## some columns having nan which is missing value, we have to replace

netflix['Actors'].replace(['nan'],['Unknown Actor'],inplace=True)
netflix['Directors'].replace(['nan'],['Unknown Director'],inplace=True)
netflix['Country'].replace(['nan'],[np.nan],inplace=True)
netflix.head()
Out[211]:
title Actors Directors Genre Country show_id type date_added release_year rating duration
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 25-Sep-21 2020 PG-13 90 min
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 24-Sep-21 2021 TV-MA 2 Seasons
In [212]:
total_null = netflix.isnull().sum().sort_values(ascending = False)
percent = ((netflix.isnull().sum()/netflix.isnull().count())*100).sort_values(ascending = False)
print("Total records = ", netflix.shape[0])

missing_data = pd.concat([total_null,percent.round(2)],axis=1,keys=['Total Missing','In Percent'])
missing_data.head(10)
Total records =  204539
Out[212]:
Total Missing In Percent
Country 12497 6.11
date_added 158 0.08
rating 67 0.03
duration 3 0.00
title 0 0.00
Actors 0 0.00
Directors 0 0.00
Genre 0 0.00
show_id 0 0.00
type 0 0.00

after replacing string nan with np.nan, actual null values of country went upto 5.89 %

In [213]:
netflix[netflix['duration'].isnull()]
Out[213]:
title Actors Directors Genre Country show_id type date_added release_year rating duration
129171 Louis C.K. 2017 Louis C.K. Louis C.K. Movies United States s5542 Movie 4-Apr-17 2017 74 min NaN
134237 Louis C.K.: Hilarious Louis C.K. Louis C.K. Movies United States s5795 Movie 16-Sep-16 2010 84 min NaN
134371 Louis C.K.: Live at the Comedy Store Louis C.K. Louis C.K. Movies United States s5814 Movie 15-Aug-16 2015 66 min NaN

duration and rating columns got messed up and values got exchanged. Will add rating column values into duration column missing values

In [214]:
netflix.loc[netflix['duration'].isnull(),'duration'] = netflix.loc[netflix['duration'].isnull(),'duration'].fillna(netflix['rating'])
netflix.loc[netflix['rating'].str.contains('min', na=False),'rating'] = 'NR'
netflix['rating'].fillna('NR',inplace=True)
netflix.isnull().sum()
Out[214]:
title               0
Actors              0
Directors           0
Genre               0
Country         12497
show_id             0
type                0
date_added        158
release_year        0
rating              0
duration            0
dtype: int64

Filling missing values of date added column with mode value with respective release years

In [215]:
for i in netflix[netflix['date_added'].isnull()]['release_year'].unique():
    date = netflix[netflix['release_year'] == i]['date_added'].mode().values[0]
    netflix.loc[netflix['release_year'] == i,'date_added'] = netflix.loc[netflix['release_year']==i,'date_added'].fillna(date)
In [216]:
netflix.isnull().sum()
Out[216]:
title               0
Actors              0
Directors           0
Genre               0
Country         12497
show_id             0
type                0
date_added          0
release_year        0
rating              0
duration            0
dtype: int64

Filling missing values of country column with mode value with respective directors

In [217]:
for i in netflix[netflix['Country'].isnull()]['Directors'].unique():
    if i in netflix[~netflix['Country'].isnull()]['Directors'].unique():
        country = netflix[netflix['Directors'] == i]['Country'].mode().values[0]
        netflix.loc[netflix['Directors'] == i,'Country'] = netflix.loc[netflix['Directors'] == i,'Country'].fillna(country)
In [218]:
netflix.isnull().sum()
Out[218]:
title              0
Actors             0
Directors          0
Genre              0
Country         4276
show_id            0
type               0
date_added         0
release_year       0
rating             0
duration           0
dtype: int64

remaing missing values will be replaced using actors column

In [219]:
for i in netflix[netflix['Country'].isnull()]['Actors'].unique():
    if i in netflix[~netflix['Country'].isnull()]['Actors'].unique():
        imp = netflix[netflix['Actors'] == i]['Country'].mode().values[0]
        netflix.loc[netflix['Actors'] == i,'Country'] = netflix.loc[netflix['Actors']==i,'Country'].fillna(imp)
In [220]:
netflix.isnull().sum()
Out[220]:
title              0
Actors             0
Directors          0
Genre              0
Country         2069
show_id            0
type               0
date_added         0
release_year       0
rating             0
duration           0
dtype: int64
In [221]:
netflix['Country'].fillna('Unknown Country',inplace=True)       
netflix.isnull().sum()
Out[221]:
title           0
Actors          0
Directors       0
Genre           0
Country         0
show_id         0
type            0
date_added      0
release_year    0
rating          0
duration        0
dtype: int64

Now missing values handling is over, will deep dive into data analysis

In [222]:
netflix.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 204539 entries, 0 to 204538
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   title         204539 non-null  object
 1   Actors        204539 non-null  object
 2   Directors     204539 non-null  object
 3   Genre         204539 non-null  object
 4   Country       204539 non-null  object
 5   show_id       204539 non-null  object
 6   type          204539 non-null  object
 7   date_added    204539 non-null  object
 8   release_year  204539 non-null  int64 
 9   rating        204539 non-null  object
 10  duration      204539 non-null  object
dtypes: int64(1), object(10)
memory usage: 26.8+ MB
In [223]:
#converting date added data type(object format) into datetime format to extract years, month 

netflix["date_added"] = pd.to_datetime(netflix['date_added'])
In [224]:
netflix.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 204539 entries, 0 to 204538
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   title         204539 non-null  object        
 1   Actors        204539 non-null  object        
 2   Directors     204539 non-null  object        
 3   Genre         204539 non-null  object        
 4   Country       204539 non-null  object        
 5   show_id       204539 non-null  object        
 6   type          204539 non-null  object        
 7   date_added    204539 non-null  datetime64[ns]
 8   release_year  204539 non-null  int64         
 9   rating        204539 non-null  object        
 10  duration      204539 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 26.8+ MB
In [225]:
## Removing the min string in duration column

netflix ['duration'] = netflix['duration'].str.replace(" min","")
netflix.head(6)
Out[225]:
title Actors Directors Genre Country show_id type date_added release_year rating duration
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 90
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
5 Blood & Water Khosi Ngema Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
In [226]:
netflix['duration2'] = netflix.duration.copy()
netflix_ = netflix.copy()
In [227]:
netflix_.loc[netflix_['duration2'].str.contains('Season'),'duration2'] = 0
netflix_['duration2'] = netflix_.duration2.astype('int')
netflix_.head()
Out[227]:
title Actors Directors Genre Country show_id type date_added release_year rating duration duration2
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 90 90
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 0
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 0
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 0
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 0
In [230]:
netflix_.duration2.describe()
Out[230]:
count    204539.000000
mean         77.503151
std          52.443402
min           0.000000
25%           0.000000
50%          95.000000
75%         112.000000
max         312.000000
Name: duration2, dtype: float64
In [231]:
netflix_.T.apply(lambda x: x.nunique(), axis=1)
Out[231]:
title            8804
Actors          36440
Directors        4994
Genre              42
Country           127
show_id          8807
type                2
date_added       1714
release_year       74
rating             14
duration          220
duration2         206
dtype: int64

Actors has the most unique values follwed by title and directors

Post Profiling¶

In [232]:
profile_clean = prf.ProfileReport(netflix)
profile_clean
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[232]:

In [233]:
profile_clean.to_file(output_file='Netflix_After_PreProcessing.html')
Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

EDA¶

  • What different types of show or movie are uploaded on Netflix?

  • Correlation between the features

  • Most watched shows on the Netflix
  • Distribution of Ratings
  • Which has the highest rating Tv show or Movies
  • Finding the best Month for releasing content
  • Highest watched genres on Netflix
  • Released movie over the years

What different types of show or movie are uploaded on Netflix?

In [234]:
##method1
netflix.groupby('type')['title'].count().sort_values(ascending=False)
Out[234]:
type
Movie      147799
TV Show     56740
Name: title, dtype: int64
In [235]:
netflix['type'].value_counts().to_frame('values_count')
Out[235]:
values_count
Movie 147799
TV Show 56740
In [236]:
netflix.groupby(["type","release_year"])["title"].agg(pd.Series.mode)
Out[236]:
type     release_year
Movie    1942                                         The Battle of Midway
         1943            [Undercover: How to Operate Behind Enemy Lines...
         1944                                             Tunisian Victory
         1945                                      Know Your Enemy - Japan
         1946                                           Let There Be Light
                                               ...                        
TV Show  2017                                                       Narcos
         2018                                                        9-Feb
         2019                                                  Creeped Out
         2020                                                     The Eddy
         2021                                                     Navarasa
Name: title, Length: 119, dtype: object

Univariate analysis of duration column

In [238]:
## Histogram to see the distribution of duration

plt.style.use('dark_background')
plt.figure(figsize=(10,2))
sns.displot(netflix_['duration2'])
Out[238]:
<seaborn.axisgrid.FacetGrid at 0x2482a8a59d0>
<Figure size 1000x200 with 0 Axes>

Most of the values is around 100 and basically 0 is the TV shows

In [239]:
bins = [-1,1,50,80,100,120,150,200,315]
labels = ['<1','1-50','50-80','80-100','100-120','120-150','150-200','200-315']
netflix_['duration2'] = pd.cut(netflix_['duration2'],bins = bins, labels = labels )
netflix_.head()
Out[239]:
title Actors Directors Genre Country show_id type date_added release_year rating duration duration2
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 90 80-100
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons <1
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons <1
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons <1
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons <1
In [240]:
netflix_.loc[~netflix_['duration'].str.contains('Season'),'duration'] = netflix_.loc[~netflix_['duration'].str.contains('Season'),'duration2']
netflix_.drop(['duration2'],axis=1,inplace=True)
netflix_.head()
Out[240]:
title Actors Directors Genre Country show_id type date_added release_year rating duration
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 80-100
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons

extracting day, week, year, month from date added column helps in checking which month got more TV shows like that

In [242]:
from datetime import datetime
from dateutil.parser import parse

netflix_["year_added"] = netflix_['date_added'].dt.year
netflix_["year_added"] = netflix_["year_added"].astype("Int64")
netflix_["month_added"] = netflix_['date_added'].dt.month
netflix_['month_name'] = netflix['date_added'].dt.month_name()
netflix_["month_added"] = netflix_["month_added"].astype("Int64")
netflix_["day_added"] = netflix_['date_added'].dt.day
netflix_["day_added"] = netflix_["day_added"].astype("Int64")
netflix_['Weekday_added'] = netflix_['date_added'].apply(lambda x: parse(str(x)).strftime("%A"))
netflix_.head()
Out[242]:
title Actors Directors Genre Country show_id type date_added release_year rating duration year_added month_added month_name day_added Weekday_added
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 80-100 2021 9 September 25 Saturday
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
In [244]:
netflix_['title'] = netflix_['title'].str.replace(r"\(.*\)","")
netflix_.head()
C:\Users\modem\AppData\Local\Temp\ipykernel_38720\2120036043.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  netflix_['title'] = netflix_['title'].str.replace(r"\(.*\)","")
Out[244]:
title Actors Directors Genre Country show_id type date_added release_year rating duration year_added month_added month_name day_added Weekday_added
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 80-100 2021 9 September 25 Saturday
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
In [314]:
netflix_genre=netflix_.groupby(['Genre']).agg({"title":"nunique"}).reset_index().sort_values(by=['title'],ascending=False)[:15]
plt.figure(figsize=(15,6))
sns.barplot(x = "Genre",y = 'title', data = netflix_genre)
plt.xticks(rotation = 60)
plt.title('Top 15 Genres')
plt.show()

International Movies, Dramas and Comedies are the most popular

In [257]:
netflix_pie = netflix_.groupby(['type']).agg({'title':'nunique'}).reset_index()
In [258]:
netflix_pie
Out[258]:
type title
0 Movie 6113
1 TV Show 2675
In [259]:
colors = sns.color_palette('bright')[0:5]
plt.figure(figsize=(10,4))

plt.pie(netflix_pie['title'], labels = netflix_pie['type'], colors = colors, autopct='%.0f%%')
plt.title('Percentage of movies and TV shows')
plt.show()

We have 70:30 ratio of Movies and TV Shows in our data

In [260]:
netflix_['Country'] = netflix_['Country'].str.replace(',', '')
netflix_.head()
Out[260]:
title Actors Directors Genre Country show_id type date_added release_year rating duration year_added month_added month_name day_added Weekday_added
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 80-100 2021 9 September 25 Saturday
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
In [261]:
netflix_country = netflix_.groupby(['Country']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "Country",x = 'title', data = netflix_country)
plt.xticks(rotation = 60)
plt.title('Top 10 Countries for content creation')
plt.show()

US,India,UK,Canada and France are leading countries in Content Creation on Netflix

In [262]:
netflix_rating = netflix_.groupby(['rating']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]

plt.figure(figsize=(15,6))
sns.barplot(y = "rating",x = 'title', data = netflix_rating)
plt.xticks(rotation = 60)
plt.title('Top 10 rating types')
plt.show()

Most of the highly rated content on Netflix is intended for Mature Audiences

In [263]:
netflix_duration = netflix_.groupby(['duration']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]

plt.figure(figsize=(15,6))
sns.barplot(y = "duration",x = 'title', data = netflix_duration)
plt.xticks(rotation = 60)
plt.title('Top 10 duaration categories')
plt.show()

The duration of Most Watched content in our whole data is 80-100 mins. These must be movies and Shows having only 1 Season.

In [269]:
netflix_actors = netflix_.groupby(['Actors']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:15]
netflix_actors = netflix_actors[netflix_actors['Actors']!='Unknown Actor']
plt.figure(figsize=(15,6))
sns.barplot(y = "Actors",x = 'title', data = netflix_actors )
plt.xticks(rotation = 60)
plt.title('Top 15 most popular Actors')
plt.show()

Anupam Kher,SRK,Julie Tejwani, Naseeruddin Shah and Takahiro Sakurai occupy the top stop in Most Watched content.

In [271]:
netflix_directors = netflix_.groupby(['Directors']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:15]
netflix_directors = netflix_directors[netflix_directors['Directors']!='Unknown Director']
plt.figure(figsize=(15,6))
sns.barplot(y = "Directors",x = 'title', data = netflix_directors )
plt.xticks(rotation = 60)
plt.title('Top 15 most popular Directors')
plt.show
Out[271]:
<function matplotlib.pyplot.show(close=None, block=None)>

Rajiv Chilaka, Jan Suter and Raul Campos are the most popular directors across Netflix

In [272]:
netflix_.head()
Out[272]:
title Actors Directors Genre Country show_id type date_added release_year rating duration year_added month_added month_name day_added Weekday_added
0 Dick Johnson Is Dead Unknown Actor Kirsten Johnson Documentaries United States s1 Movie 2021-09-25 2020 PG-13 80-100 2021 9 September 25 Saturday
1 Blood & Water Ama Qamata Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
2 Blood & Water Ama Qamata Unknown Director TV Dramas South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
3 Blood & Water Ama Qamata Unknown Director TV Mysteries South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
4 Blood & Water Khosi Ngema Unknown Director International TV Shows South Africa s2 TV Show 2021-09-24 2021 TV-MA 2 Seasons 2021 9 September 24 Friday
In [273]:
netflix_year = netflix_.groupby(['year_added']).agg({'title':'nunique'}).reset_index()

plt.figure(figsize=(15,6))
sns.lineplot(x = "year_added",y = 'title', data = netflix_year, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows across the years')
plt.show
Out[273]:
<function matplotlib.pyplot.show(close=None, block=None)>

The Amount of Content across Netflix has increased from 2008 continuously till 2019. Then started decreasing from here(probably due to Covid)

In [274]:
fig = plt.figure(figsize = (15,5))

#plt.style.use('dark_background')
sns.countplot(data = netflix_,x = 'year_added',hue = 'type',palette ="Reds_r")
plt.title('Movies and TV Shows added added to Netflix by date ', fontsize=14)
Out[274]:
Text(0.5, 1.0, 'Movies and TV Shows added added to Netflix by date ')

Over the years both TV shows and movie contents addtion has increased after 2020 its started declining may be due to Covid relief, Movies addtion is more compare to TV shows over the years

In [282]:
netflix_month = netflix_.groupby(['month_name', 'type']).agg({'title':'nunique'}).reset_index()

plt.figure(figsize=(15,6))
sns.lineplot(x = "month_name",y = 'title', data = netflix_month, color = 'red', hue = netflix_month.type )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across the months')
plt.show
Out[282]:
<function matplotlib.pyplot.show(close=None, block=None)>

for both TV shows and Movies best launch month remain same which is July followed by December

In [283]:
netflix_month = netflix_.groupby(['month_name']).agg({'title':'nunique'}).reset_index()

plt.figure(figsize=(15,6))
sns.lineplot(x = "month_name",y = 'title', data = netflix_month, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across the months')
plt.show
Out[283]:
<function matplotlib.pyplot.show(close=None, block=None)>

In general most of the content get added in december and july month

In [284]:
netflix_day = netflix_.groupby(['day_added']).agg({'title':'nunique'}).reset_index()

plt.figure(figsize=(15,6))
sns.barplot(x = "day_added",y = 'title', data = netflix_day, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across each day')
plt.show
Out[284]:
<function matplotlib.pyplot.show(close=None, block=None)>

It was evident that 1st of every month was when the most content was added.

In [285]:
netflix_weekday = netflix_.groupby(['Weekday_added', 'type']).agg({'title':'nunique'}).reset_index()

plt.figure(figsize=(15,6))
sns.lineplot(x = "Weekday_added",y = 'title', data = netflix_weekday, color = 'red' , hue = netflix_weekday.type)
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across weekdays')
plt.show
Out[285]:
<function matplotlib.pyplot.show(close=None, block=None)>
In [286]:
netflix_weekday = netflix_.groupby(['Weekday_added']).agg({'title':'nunique'}).reset_index()

plt.figure(figsize=(15,6))
sns.lineplot(x = "Weekday_added",y = 'title', data = netflix_weekday, color = 'red' )
plt.xticks(rotation = 60)
plt.title('movies/ TV shows added across weekdays')
plt.show
Out[286]:
<function matplotlib.pyplot.show(close=None, block=None)>

For content release on Netflix, Friday is the best day followed by Thursday

In [291]:
plt.figure(figsize=(15,6))
sns.boxplot(x='type', y='release_year', data=netflix_, )
sns.despine(left=True)
plt.title('Type of Show by Release Date')
plt.ylim(2000,2020)
Out[291]:
(2000.0, 2020.0)

It sees tv shows have a more recent release_year. This means tv shows are releasing more in recent years

Bivariate Analysis¶

In [309]:
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 
               'October', 'November', 'December']
content = netflix_.groupby('year_added')['month_name'].value_counts().unstack().fillna(0)[month_order].T

plt.figure(figsize=(10,8))
plt.title("Number of months' content added per year")
sns.heatmap(content , cmap = 'Blues')
plt.show()

Most number of Movies and TV shows were added in November, 2019 and July, 2021

Fewer movies and TV shows were added from 2008 to 2015

In [310]:
plt.figure(figsize = (12,5))
sns.scatterplot(y = netflix_.index , x = netflix_.release_year , hue = netflix_.type , palette='Set2')
Out[310]:
<AxesSubplot:xlabel='release_year'>
In [311]:
netflix_.groupby(['day_added']).agg({"title":"nunique"})
Out[311]:
title
day_added
1 2219
2 325
3 151
4 175
5 231
6 210
7 190
8 201
9 148
10 214
11 149
12 181
13 175
14 198
15 688
16 289
17 180
18 205
19 243
20 248
21 190
22 230
23 182
24 159
25 196
26 205
27 195
28 190
29 140
30 210
31 274

It was evident that 1st of every month was when the most content was added.

Univariate Analysis separately for shows and movies¶

In [312]:
netflix_shows = netflix_[netflix_['type']=='TV Show']
netflix_movies = netflix_[netflix_['type']=='Movie']
In [316]:
netflix_genre = netflix_shows.groupby(['Genre']).agg({"title":"nunique"}).reset_index().sort_values(by=['title'],ascending=False)[:15]
plt.figure(figsize = (15,6))
sns.barplot(y = "Genre",x = 'title', data = netflix_genre)
plt.xticks(rotation = 60)
plt.title('Top 15 Genres')
plt.show()

International TV Shows, Dramas and Comedy Genres are popular across TV Shows in Netflix

In [317]:
netflix_genre = netflix_movies.groupby(['Genre']).agg({"title":"nunique"}).reset_index().sort_values(by=['title'],ascending=False)[:15]
plt.figure(figsize = (15,6))
sns.barplot(y = "Genre",x = 'title', data = netflix_genre)
plt.xticks(rotation = 60)
plt.title('Top 15 Genres')
plt.show()

International Movies, Dramas and Comedy Genres are popular followed by Documentaries across Movies on Netflix

In [319]:
netflix_country = netflix_shows.groupby(['Country']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "Country",x = 'title', data = netflix_country)
plt.xticks(rotation = 60)
plt.title('Top 10 Countries for content creation')
plt.show()
In [320]:
netflix_country = netflix_movies.groupby(['Country']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
sns.barplot(y = "Country",x = 'title', data = netflix_country)
plt.xticks(rotation = 60)
plt.title('Top 10 Countries for content creation')
plt.show()

United States is leading across both TV Shows and Movies, UK also provides great content across TV Shows and Movies. Surprisingly India is much more prevalent in Movies as compared TV Shows.

Moreover the number of Movies created in India outweigh the sum of TV Shows and Movies across UK since India was rated as second in net sum of whole content across Netflix.

4. Conclusion¶

Business Insights¶

  • Over the years both TV shows and movie contents addtion has increased till 2020, but after 2020 its started declining may be due to Covid relief, number of Movies added is more compare to TV shows over the years

  • Most of the content get added in december and july month, for day wise, Friday is the best day followed by Thursday

  • It was evident that 1st of every month was when the most content was added.

  • Anupam Kher,SRK,Julie Tejwani, Naseeruddin Shah and Takahiro Sakurai occupy the top stop in Most Watched content.

  • Rajiv Chilaka, Jan Suter and Raul Campos are the most popular directors across Netflix

  • Rajiv Chilaka director producing more movies

  • Netflix is more focussing on movies compare to TV shows

  • There is a 70:30 ratio of Movies and TV Shows content in Netflix platform

  • International Movies, Dramas and Comedies are the most popular are most popular Genre

  • US,India,UK,Canada and France are leading countries in Content Creation on Netflix

  • Most of the highly rated content on Netflix is intended for Mature Audiences

  • The duration of Most Watched content in our whole data is 80-120 mins. These must be movies and Shows having only 1 Season.

  • United States is leading across both TV Shows and Movies, UK also provides great content across TV Shows and Movies. Surprisingly India is much more prevalent in Movies as compared TV Shows.

  • Moreover the number of Movies created in India outweigh the sum of TV Shows and Movies across UK since India was rated

Recommendations¶

  • The most popular Genres across the countries and in both TV Shows and Movies are Drama, Comedy and International TV Shows/Movies, so recommended to generate more content on these genres.

  • Add TV Shows/ movies in the month of July 1st or August 1st.

  • Add movies for Indian Audience, it has been declining since 2018.

  • While creating content, take into consideration the popular actors/directors for that country. Also take into account the director-actor combination which is highly recommended.

  • For audience 80-120 mins is the recommended length for movies.